6.13. Refresher on statistical inference¶
Where do we get \(p\)-values from? What is a distribution? Where do we get distributions from?
Let’s start by defining a distribution. In statistics, we use the term distribution to describe the occurrence of a random variable. Say we were measuring plant height for a particular species. Plant height will be a “continuously distributed” random variable – meaning heights can assume an infinite number of values. If we were to plot those heights along an x-axis, then as we continue to obtain new data points we would wind up with having to place points on top of each other. If we continued to do this sampling and plotting we would see a shape emerge (e.g. Distribution of plant heights). This shape can be thought of as the distribution of the plant height random variable. In this hypothetical example, we have obtained our distribution from the data itself.
In bioinformatics, if we’re lucky, the random variable of interest belongs to a known distribution that is described by a mathematical function. That function specifies the probability of observing any specific value. Examples include the Normal (or Gaussian) distribution, Gamma distribution and, of particular use to the task of understanding \(p\)-values, the uniform distribution.
6.13.1. Hypothesis testing and the \(p\)-value¶
I’m going to present this principally by focussing on the null hypothesis (denoted \(H_o\)). This corresponds most closely with the approach for data analysis advocated by Fisher (see [Per15]).
In this instance, our null corresponds to some notion of how we think the data may be distributed. Let’s imagine we have the complete genome sequences of two different species. We could pose the null hypothesis that the abundance of the different nucleotides in the sequence will be the same. To quantify this, we select a statistical procedure for comparing counts data (e.g. a chi-square test) and decide, a priori, on our significance threshold (we will choose 0.05 for now, but more on this later). We compute our test-statistic (a \(\chi^2\) in this hypothetical case) and “look up” the corresponding \(p\)-value. Say our hypothetical analysis of nucleotide counts returned a \(\chi^2=7.9\). Then, given 3 degrees of freedom, our p-value is \(\approx 0.048<0.05\).
So what does this \(p\)-value correspond to? First, it’s actually a cumulative probability attained by summing the tail of the \(\chi^2_3\) distribution for all values \(\ge 7.9\). So a \(p\)-value is not a point probability (that would be the probability of observing a value of precisely \(7.9\)). Another point worth making here is that the \(p\)-value is that expected under the idealised theoretical (mathematical) distribution 1.
- 1
Below I provide a brute-force demonstration that \(p\)-values derive from the null distribution. I do this by showing a computational technique for estimating a \(p\)-value from a generated null distribution. I further demonstrate, for that case, that the estimated \(p\)-value is extremely close to the one obtained by using the theoretical distribution.
Importantly, a \(p\)-value corresponds to the cumulative probability of observing data that give an equivalent or larger \(\chi^2\) by chance (from the theoretical distribution) when the null is True. (The latter often phrased as “under the null”.) Even the rather extreme value 30.66 is possible under the null but, with a \(p\)-value of \(\approx 1\times10^-6\), we can expect to encounter data delivering such a statistic extremely rarely.
A significance threshold is a decision point that you choose. It is a cutoff for you to decide when to reject the null hypothesis as an acceptable explanation for the data. In the Fisherian approach, which does not formalise the role of an alternate hypothesis, this is a sufficient basis for pursuing additional experimental avenues. The \(p\)-value cannot, however, be proof that the null is wrong. (Given where the \(p\)-value comes from, I hope that the logical fallacy of such a conclusion is obvious.)
The notion of an alternate hypothesis (denoted \(H_a\)) originates from the framework of Neyman-Pearson. There are overlaps in outcome with the approach of Fisher, but significant differences too (again, see [Per15] for an accessible summary). You will also encounter the Neyman-Pearson approach in the material here.
6.13.2. The uniform distribution¶
We now turn our attention to a description of the uniform distribution which we will use to introduce a measure related to the \(p\)-value as a cumulative measure of probability density.
Consider a random variable that can obtain any value in [0, 1] (including the boundaries, see uniform distribution histogram). We call such a random variable uniformly distributed if all possible values of that random variable have an equal probability of occurring. The probability of a value of 0.2 is equal that of observing of 0.8, 0.9, or 0.0.
6.13.2.1. Quantiles as distribution descriptors¶
Quantiles are rank order statistics. They are locations in a sorted collection of values. One example of a quantile you are likely familiar with is the median, which cuts a distribution such that 1/2 of all values are less than it. Following this example, then, a quantile=0.05 is the point that is greater the 1/20th of all values. We can think of a values quantile, then, as its relative rank with a data set which can be computed as \(\frac{r}{n}\) where \(r\) is the rank in \(n\) values.
Let’s play with the quantiles from the uniform distribution that I generated above. We use the numpy.quantile function for this purpose. Since we’re using a uniform distribution, and following from the definition of this distribution, we can expect that 5% of all uniform random values will be \(\le 0.05\). Does our data support this?
from numpy import quantile
quantile(x_uniform, 0.05)
0.05009220377271389
Conversely, we expect that 5% of all uniform random values will be \(\ge 0.95\)
1 - quantile(x_uniform, 0.95)
0.04950840125157163
We generated these data using a sample size of 50,000. As we increase that sample size, you will find the estimates of the quantiles from the uniform distribution converge on their expected values. We can generalise this statement further, as you increase the sample size the quantile becomes an increasingly good approximation of its \(p\)-value.
Quantiles have advantages over the \(p\)-values in exploratory data analysis. Not least of which they are derived from the actual data, rather than idealised (theoretical) description. Numerous data exploratory techniques are based upon this quantity (for example Quantile-Quantile plots to compare the distributions of two data sets).
6.13.3. Resampling statistics – brute-force generation of null distributions¶
A challenge often encountered in bioinformatics is that a random variable of interest does not follow a known distribution. In these case, a popular statistical approach is to use so called resampling approaches.
If they derive from some type of permutation of observed data (as we will do below) then they are often referred to as “non-parametric” methods. Such techniques have value for estimating the confidence interval for a parameter (e.g. jackknife) or estimating a p-value (e.g. permutation tests).
6.13.3.1. Computational approaches – resampling with replacement¶
We now consider a specific problem which we will solve using random sampling with replacement 2.
- 2
To illustrate “with replacement”. We randomly draw an observation from the observed data set and add it to our “resampled” set. We then return the observation back to the observed data. This means the probability of observing that specific state never changes. In the alternate approach of resampling without replacement, the probability of drawing a specific state decreases with each subsequent draw of it.
6.13.4. A worked example for estimating a p-value using a resampling statistic¶
We have a DNA sequence and we want to evaluate whether nucleotides occur randomly in the sequence. We will tackle that question by using non-overlapping dinucleotides and assessing whether their frequency is consistent with the frequencies of their constituent nucleotides.
Here’s the sequence we will use.
seq = [
"ATGAAATCCAACCAAGAGCGGAGCAACGAATGCCTGCCTCCCAAGAAGCG",
"CGAGATCCCCGCCACCAGCCGGTCCTCCGAGGAGAAGGCCCCTACCCTGC",
"CCAGCGACAACCACCGGGTGGAGGGCACAGCATGGCTCCCGGGCAACCCT",
"GGTGGCCGGGGCCACGGGGGCGGGAGGCATGGGCCGGCAGGGACCTCGGT",
"GGAGCTTGGTTTACAACAGGGAATAGGTTTACACAAAGCATTGTCCACAG",
"GGCTGGACTACTCCCCGCCCAGCGCTCCCAGGTCTGTCCCCGTGGCCACC",
"ACGCTGCCTGCCGCGTACGCCACCCCGCAGCCAGGGACCCCGGTGTCCCC",
"CGTGCAGTACGCTCACCTGCCGCACACCTTCCAGTTCATTGGGTCCTCCC",
"AATACAGTGGAACCTATGCCAGCTTCATCCCATCACAGCTGATCCCCCCA",
"ACCGCCAACCCCGTCACCAGTGCAGTGGCCTCGGCCGCAGGGGCCACCAC",
"TCCATCCCAGCGCTCCCAGCTGGAGGCCTATTCCACTCTGCTGGCCAACA",
"TGGGCAGTCTGAGCCAGACGCCGGGACACAAGGCTGAGCAGCAGCAGCAG",
]
seq = "".join(seq)
Before we do anything, we need to consider first what our null hypothesis will “look” like and to use that perspective in deciding how we will approach this problem algorithmically. If nucleotides occur randomly within a DNA sequence, we expect that the dinucleotides will consist of randomly drawn nucleotides. Stated another way, we construct a dinucleotide by randomly drawing the first nucleotide from the pool of nucleotides and then drawing the second nucleotide from the same pool of nucleotides. In terms of a probability calculation, we expect the probability of dinucleotide \(i, j\) to be specified as
where \(p(i,j)\) is the probability of dinucleotide \(i,j\), and \(p(i)\), \(p(i)\) the probabilities of nucleotides \(i\) and \(j\) respectively.
This is actually the calculation made when we perform a chi-square test for independence, so we will do that here. Let’s use this simple DNA sequence – "AACCCCGT" – to illustrate the steps we need to take in order to be able to compute a chi-square statistic.
Split the sequence into dinucleotides: From our sample sequence, we need to produce the series of dinucleotides
["AA", "CC", "CC", "GT"].def seq_to_dinucs(seq): seq = "".join(seq) dinucs = [seq[i: i + 2] for i in range(0, len(seq) - 1, 2)] return dinucs dinucs = seq_to_dinucs("AACCCCGT")
Define a nucleotide order: We need this in order to be able to convert the dinucleotide string into array coordinates. We define nucleotides to be in alphabetical order. This means that the dinucleotide
"AA"corresponds to indices(0, 0)whileGTcorresponds to indices(2, 3).nucleotide_order = "ACGT"
Convert dinucleotides into pairs of indices: I’ll do this by writing a function that converts a single dinucleotide into coordinates. Applying this to the sample sequence we get
def dinuc_to_indices(dinuc): return tuple(nucleotide_order.index(nuc) for nuc in dinuc) coords = [dinuc_to_indices(dinuc) for dinuc in dinucs] coords
[(0, 0), (1, 1), (1, 1), (2, 3)]
Use dinucleotide indices to increment counts in a matrix: We will use a numpy array for the counts. Think of the row and column labels for this as array corresponding to the nucleotides present at the first and second position of a dinucleotide. For our example, we get the following
from numpy import zeros def make_counts_matrix(coords): counts = zeros((4,4), dtype=int) for i, j in coords: counts[i, j] += 1 return counts observed = make_counts_matrix(coords) observed
array([[1, 0, 0, 0], [0, 2, 0, 0], [0, 0, 0, 1], [0, 0, 0, 0]])Use those counts to compute the expected values: This can be achieved quite simply here by first generating row and column sums, converting those to frequencies plus a couple of other steps (detail is below).
from numpy import outer def get_expected(counts): total = counts.sum() row_sums = counts.sum(axis=1) col_sums = counts.sum(axis=0) row_probs = row_sums / total col_probs = col_sums / total expecteds = outer(row_probs, col_probs) * total return expecteds expected = get_expected(observed) expected
array([[0.25, 0.5 , 0. , 0.25], [0.5 , 1. , 0. , 0.5 ], [0.25, 0.5 , 0. , 0.25], [0. , 0. , 0. , 0. ]])Generate the chi-square statistic: This is defined as follows
\[\chi^2=\sum_i\frac{(O_i-E_i)^2}{E_i}\]Where \(O_i\) and \(E_i\) correspond to the observed and expected counts for dinucleotide \(i\) and the summation is over all dinucleotides.
We express this as a Python function and apply it to our simple example. (The numpy array operations greatly simplify the calculation.)
def calc_chisq(observed, expected): chisq = (observed - expected)**2 / expected return chisq.sum() calc_chisq(observed, expected)
nan
Note
The nan that was output from the calc_chisq() was generated because we were doing a division with 0 in the denominator. So time to switch to using the full sequence now.
Let’s provide a simplified interface to all these function calls such that if we provide our sequence, all the above steps are called and we get back our chi-square statistic.
def chiqsq_independent_nucs(seq):
dinucs = seq_to_dinucs(seq)
coords = [dinuc_to_indices(dinuc) for dinuc in dinucs]
observed = make_counts_matrix(coords)
expected = get_expected(observed)
return calc_chisq(observed, expected)
chiqsq_independent_nucs(seq)
22.577013388423254
So that’s nice, we are now able to compute the statistic of interest given our sequence. How do we generate the null? We can generate synthetic data sets consistent with the null by randomly sampling from our actual data. This requires we have a means for making a random choice of a nucleotide to sample from our observed data. Algorithms for generating pseudo-random numbers are important for scientific computing and, as you might expect, there are numerous choices. (The Python standard library comes with a builtin capability for generating such numbers using a well regarded algorithm.) In our case, we can just use a shuffle function. Note that shuffle works “in place”, meaning it modifies the data you provide, so we need to convert our sequence into a list.
from numpy.random import shuffle
tmp = list("AACCCCGT")
shuffle(tmp)
tmp
['A', 'C', 'C', 'T', 'C', 'A', 'G', 'C']
Will our functions still work if we give them a list?
chiqsq_independent_nucs(list(seq))
22.577013388423254
Yup!
To recap, we have a function that (given a sequence) returns the chi-square statistic for the independence of the nucleotides at the first and second positions of dinucleotides. We want to generate the null distribution for this statistic so that we can assess how unusual the statistic from the observed data is. We do that by defining how many synthetic replicates we want to generate (we will call this num_reps). Each of these synthetic sequences is generated in accordance with the null (the order of nucleotides is random) and a chi-square statistic computed. We can therefore use the number of these chi-square values from the generated null distribution that are ≥ than the chi-square from the observed sequence (we denote this quantity \(k\)) to estimate the \(p\)-value for our observed value as \(\frac{k}{num\_reps}\).
So here’s the final function.
def calc_chisq_pval(seq, num_reps):
obs_stat = chiqsq_independent_nucs(seq)
seq = list(seq)
k = 0
for i in range(num_reps):
shuffle(seq)
chisq = chiqsq_independent_nucs(seq)
if chisq >= obs_stat:
k += 1
return k / num_reps
calc_chisq_pval(seq, 2000)
0.009
If we compare this result to one obtained by explicitly using the chi-square distribution we can see they are very close.
| chisq | df | pvalue |
|---|---|---|
| 22.577 | 9 | 0.0072 |
6.13.5. Parametric based simulation¶
This is another simulation based approach to inference. It differs in an important way from the above – you have a “generating” model. What that means is you have a full probabilistic expression that you can then use to produce synthetic observations.
For the PSSM case, for example, the background model (equiprobable states) is a generating model.
Citations
- Per15(1,2)
Jose D. Perezgonzalez. Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6:223, 2015. URL: https://www.frontiersin.org/article/10.3389/fpsyg.2015.00223, doi:10.3389/fpsyg.2015.00223.